Skip to content

Add bounded timeouts to BlobStore shutdown join() calls#6183

Closed
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/blobstore-shutdown-timeout
Closed

Add bounded timeouts to BlobStore shutdown join() calls#6183
joewiz wants to merge 1 commit intoeXist-db:developfrom
joewiz:bugfix/blobstore-shutdown-timeout

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 26, 2026

Summary

BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown() called Thread.join() with no timeout on the PersistentWriter and BlobVacuum threads. If either thread failed to terminate promptly, the shutdown would hang indefinitely — causing CI test suite timeouts when the surefire fork couldn't exit.

This supersedes #6179, which incorrectly made BlobStore threads daemon threads. As @adamretter (author of BlobStore) correctly pointed out, these threads must remain non-daemon to ensure pending blob writes complete. The fix is bounded join() timeouts during shutdown — giving threads 30s to drain before proceeding.

What changed

BlobStoreImpl.java — Added 30-second bounded timeouts to all three Thread.join() calls:

  • normalClose() — PersistentWriter: join(30_000), then interrupt() + join(5_000) if still alive
  • normalClose() — BlobVacuum: join(30_000) with warning if still alive
  • abnormalPersistentWriterShutdown() — BlobVacuum: join(30_000) with warning if still alive

Why this is safe

The existing shutdown mechanisms are correct and unchanged:

  1. PersistentWriter receives a POISON_PILL via the persist queue → drains remaining items and exits
  2. BlobVacuum receives interrupt() → wakes from PriorityBlockingQueue.take() and exits

In practice, both threads respond within milliseconds. The 30s timeout is a safety net — it only fires if something goes wrong, ensuring shutdown completes instead of hanging forever.

Test results

  • 27/27 BlobStore-specific tests pass
  • 3/3 full test suite trials: BUILD SUCCESS, clean exit (3:26–4:23)

Test plan

  • BlobStoreImplTest — 27 tests, 0 failures
  • Full mvn test -pl exist-core — 3 trials, all pass, no hangs
  • CI passes on all platforms

🤖 Generated with Claude Code

BlobStoreImpl.normalClose() and abnormalPersistentWriterShutdown()
called Thread.join() with no timeout on the PersistentWriter and
BlobVacuum threads. If either thread failed to terminate promptly,
the shutdown would hang indefinitely — causing CI test suite timeouts.

Add 30-second bounded timeouts to all join() calls. If the
PersistentWriter doesn't terminate within 30s, it is interrupted
and given 5 more seconds. The threads remain non-daemon to ensure
pending blob writes complete during normal shutdown.

This supersedes eXist-db#6179, which incorrectly made BlobStore threads
daemon threads. As @adamretter (author of BlobStore) pointed out,
these threads must remain non-daemon to ensure pending blob writes
complete. The fix is bounded join() timeouts during shutdown —
giving threads 30s to drain before proceeding.

Verified: 3/3 full test suite trials pass with clean exit (3:26-4:23),
27/27 BlobStore-specific tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz requested a review from a team as a code owner March 26, 2026 01:57
@adamretter
Copy link
Copy Markdown
Contributor

@joewiz Unfortunately, in practice this has the same corruption issue as your previous approach in #6179, it just goes abount achieving that corruption in a different manner.

@joewiz joewiz marked this pull request as draft March 26, 2026 13:34
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Mar 26, 2026

[This response was co-authored with Claude Code. -Joe]

Closing this PR. Adam is right that the bounded timeout has the same fundamental risk as the daemon thread approach — if the timeout fires and closeBlobStore() proceeds while the writer is mid-write, data corruption is possible.

The CI hang symptom that motivated this fix is now addressed through other means:

The underlying BlobStore shutdown behavior (non-daemon threads that can prevent JVM exit if they hang) remains as-is. It's the correct design for data integrity — these threads must complete their work before the JVM exits.

@joewiz joewiz closed this Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants